Scenario

Based on 2022 Singapore HDB resale price (real-life) data sets, your team is supposed to construct a multiple regression model (for one particular district) to explain the HDB resale price (ResalePrice) in dollars with the given independent variables.

Marking Criteria

Documentation and Presentation: 10 marks

Methodology: 10 marks

R-codes, computer outputs interpretation and graphical explanations: 15 marks

Recommendations and conclusions: 15 marks

Format:

Written Report: PDF Format: Within 10pages excluding the cover page and Appendix

Appendix: codes with computer outputs

You are required to provide the detailed documentation of how you search your recommended model for inference purpose and justify each step in your data analysis. You are also expected to provide model assumption justification and hypothesis testing evidences (R-codes and computer outputs) with clear explanations that your recommended model is the best model among all the models considered according to BIC criterion. Based on your final recommended model, state clearly your recommendations and conclusions.

Part 1 of Project

Load Data In

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Sengkang <- read.csv("data/Sengkang2023P.csv", stringsAsFactors = TRUE)

head(Sengkang, 5)
##      Date   Type Block           Street    Story Area   Model LeaseBegin        LeaseRemain
## 1 2022-01 3 ROOM  331C    ANCHORVALE ST 10 TO 12   67 Model A       2015 92 years 09 months
## 2 2022-01 3 ROOM  209C COMPASSVALE LANE 16 TO 18   67 Model A       2011 88 years 11 months
## 3 2022-01 3 ROOM  211C COMPASSVALE LANE 10 TO 12   68 Model A       2013 90 years 02 months
## 4 2022-01 3 ROOM  467B    FERNVALE LINK 04 TO 06   68 Model A       2016 93 years 08 months
## 5 2022-01 3 ROOM  414B    FERNVALE LINK 07 TO 09   68 Model A       2016  93 years 01 month
##   ResalePrice
## 1      420000
## 2      418000
## 3      410000
## 4      382000
## 5      410000

From the regression above, we know that there are too many categorical variables to consider. We have to condense them accordingly.

length(unique(Sengkang$Block))
## [1] 521
length(unique(Sengkang$Street))
## [1] 29
length(unique(Sengkang$LeaseRemain))
## [1] 226

We can see from the code chunk above that block will have 521, street has 29 and LeaseRemain has 226 categorical variables

Data Wrangling

We will use the lubridate package to adjust the year and calculate lease years used as a numeric rather than a categorical variable. Since years_used and LeaseRemain are perfected correlated, we will drop LeaseRemaind from the dataframe.

site, the HDB blocks are numbered by 100+, 200+, 300+ and 400+ in Rivervale, Compassvale, Anchorvale and Fernvale respectively.

df <- Sengkang %>%
  mutate(Date = lubridate::ym(Date),
         LeaseBegin = lubridate::ym( paste0(LeaseBegin,"-01")),
         years_used = as.numeric((Date - LeaseBegin)/365),
         subzone = ifelse(grepl("^1", Block), "Rivervale",
                          ifelse(grepl("^2", Block), "Compassvale",
                                 ifelse(grepl("^3", Block), "Anchorvale",
                                        ifelse(grepl("4", Block), "Fernvale", "others")))),
         .before = Street) %>%
  mutate(subzone= as.factor(subzone),
         Date = as.factor(Date)) %>%
  select(-LeaseRemain, -LeaseBegin)

Using leaseremain as a factor will generate too many binary variables. Convert them into years_used would be easier. Date and LeaseBegin variables must be in date type before substracting between the two. The output would be in (drtn) days and thus we have to set it to numeric set to years.

Find the relationship between independent variables and Resale Price

plot(df$Date, df$ResalePrice)

plot(df$Type, df$ResalePrice)

plot(df$Block, df$ResalePrice)

plot(df$years_used, df$ResalePrice)

plot(df$subzone, df$ResalePrice)

plot(df$Street, df$ResalePrice)

plot(df$Story, df$ResalePrice)

plot(df$Area, log(df$ResalePrice))

plot(df$Model, df$ResalePrice)

Run Regression

All variables

We will first calculate the BIC of the regression of all variables.

reg_all <- lm(ResalePrice ~ ., data = df)
BIC(reg_all)
## [1] 52085.02

Residual check

r <- residuals(reg_all)
plot(df$Date, r,
     xlab = "Date", ylab = "Residuals")

plot(df$Type, r,
     xlab = "Type", ylab = "Residuals")

plot(df$Block, r,
     xlab = "Block", ylab = "Residuals")

plot(df$years_used, r,
     xlab = "years_used", ylab = "Residuals")

plot(df$subzone, r,
     xlab = "subzone", ylab = "Residuals")

plot(df$Street, r,
     xlab = "Street", ylab = "Residuals")

plot(df$Story, r,
     xlab = "Story", ylab = "Residuals")

plot(df$Area, r,
     xlab = "Area", ylab = "Residuals")

plot(df$Model, r,
     xlab = "Model", ylab = "Residuals")

From the residual plots above, we notice all plots points are scattered around the residual =0. This suggesrt that the model assumptions are not violated. We have three types of location columns now, Block, Street and subzone. There should be high correlation between the X variables for these 3 variables, we will test the BIC number by dropping each variable out and picking the model with the lowest BIC ## Removing Block

L1 <- lm(ResalePrice ~ .-Block, data = df)

Removing Street

L2 <- lm(ResalePrice ~ .-Street, 
         data = df)

Removing subzone

L3 <- lm(ResalePrice ~ .-subzone, 
         data = df)

It would seem removing Block would be the best. Now we will compare between street and subzone ## Removing Block and Street

L4 <- lm(ResalePrice ~ .-Block-Street, 
         data = df)

Removing Block and subzone

L5 <- lm(ResalePrice ~ .-Block-subzone, data = df)

Comparing BIC

BIC_location <- data.frame(lm = c(".-Block",".-Street",".-subzone",".-Block-Street",".-Block-subzone"),
                              BIC = c(BIC(L1),BIC(L2),BIC(L3),BIC(L4),BIC(L5)))
BIC_location
##                lm      BIC
## 1         .-Block 49895.90
## 2        .-Street 52085.02
## 3       .-subzone 52085.02
## 4  .-Block-Street 50966.23
## 5 .-Block-subzone 49941.80

Visualize L1

plot(L1)

Interactive Variables

Note: We will be using L1 as the main regression and adding on to that regression. We know that the model is linear as checked under residual plots assumptions.

L6 <- lm(ResalePrice ~ .-Block,
         data = df)
summary(L6)
## 
## Call:
## lm(formula = ResalePrice ~ . - Block, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125199  -18458    -391   16605  127493 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             210695.6    22823.2   9.232  < 2e-16 ***
## Date2022-02-01            3068.6     3127.6   0.981 0.326637    
## Date2022-03-01            9561.5     3012.4   3.174 0.001525 ** 
## Date2022-04-01           19707.9     2957.8   6.663 3.43e-11 ***
## Date2022-05-01           22649.6     3022.6   7.493 9.91e-14 ***
## Date2022-06-01           30197.7     3101.4   9.737  < 2e-16 ***
## Date2022-07-01           34030.9     2899.1  11.738  < 2e-16 ***
## Date2022-08-01           35921.4     2959.2  12.139  < 2e-16 ***
## Date2022-09-01           46776.8     2968.3  15.759  < 2e-16 ***
## Date2022-10-01           54208.3     3198.9  16.946  < 2e-16 ***
## Date2022-11-01           56573.0     3172.8  17.831  < 2e-16 ***
## Date2022-12-01           63968.7     3084.1  20.742  < 2e-16 ***
## Type4 ROOM               49589.9     8541.6   5.806 7.41e-09 ***
## Type5 ROOM              111718.7    14907.3   7.494 9.85e-14 ***
## years_used               -7932.9      170.6 -46.503  < 2e-16 ***
## subzoneCompassvale       41771.1     8999.2   4.642 3.67e-06 ***
## subzoneFernvale         -19184.9     4668.2  -4.110 4.12e-05 ***
## subzoneRivervale        -34282.0     8746.1  -3.920 9.16e-05 ***
## StreetANCHORVALE DR      31753.6     6185.3   5.134 3.11e-07 ***
## StreetANCHORVALE LANE   -12441.5     7709.0  -1.614 0.106706    
## StreetANCHORVALE LINK    12221.1     4386.9   2.786 0.005388 ** 
## StreetANCHORVALE RD       -361.1     4641.1  -0.078 0.937990    
## StreetANCHORVALE ST       4879.3     5521.0   0.884 0.376923    
## StreetCOMPASSVALE BOW    49697.4    10253.2   4.847 1.35e-06 ***
## StreetCOMPASSVALE CRES  -35722.9     9568.3  -3.733 0.000194 ***
## StreetCOMPASSVALE DR     31509.0     9913.0   3.179 0.001502 ** 
## StreetCOMPASSVALE LANE  -31117.6     9936.8  -3.132 0.001763 ** 
## StreetCOMPASSVALE LINK   65221.1    10239.0   6.370 2.33e-10 ***
## StreetCOMPASSVALE RD      5531.1    10588.8   0.522 0.601478    
## StreetCOMPASSVALE ST    -34202.2    10359.9  -3.301 0.000978 ***
## StreetCOMPASSVALE WALK    2385.9    10932.0   0.218 0.827260    
## StreetFERNVALE LANE      11616.4     7948.8   1.461 0.144058    
## StreetFERNVALE LINK       8008.5     4145.7   1.932 0.053525 .  
## StreetFERNVALE RD        19886.0     4575.1   4.347 1.45e-05 ***
## StreetFERNVALE ST         6151.8     6283.4   0.979 0.327671    
## StreetJLN KAYU            6159.1     9980.8   0.617 0.537240    
## StreetRIVERVALE CRES     10162.6     9429.7   1.078 0.281284    
## StreetRIVERVALE DR       46191.2     9828.5   4.700 2.78e-06 ***
## StreetRIVERVALE ST       48896.7    11317.6   4.320 1.63e-05 ***
## StreetRIVERVALE WALK     73305.2    11396.4   6.432 1.56e-10 ***
## StreetSENGKANG CTRL      70136.4    10972.0   6.392 2.02e-10 ***
## StreetSENGKANG EAST AVE  66487.3    10535.5   6.311 3.39e-10 ***
## StreetSENGKANG EAST RD    1205.1    13776.8   0.087 0.930305    
## StreetSENGKANG EAST WAY  46129.6     6596.5   6.993 3.62e-12 ***
## StreetSENGKANG WEST AVE  31509.1     9286.4   3.393 0.000704 ***
## StreetSENGKANG WEST WAY       NA         NA      NA       NA    
## Story04 TO 06            25592.7     2314.5  11.058  < 2e-16 ***
## Story07 TO 09            42931.7     2287.7  18.767  < 2e-16 ***
## Story10 TO 12            48919.7     2365.3  20.682  < 2e-16 ***
## Story13 TO 15            57804.1     2288.3  25.261  < 2e-16 ***
## Story16 TO 18            66014.4     2852.9  23.139  < 2e-16 ***
## Story19 TO 21            62481.2     4257.8  14.674  < 2e-16 ***
## Story22 TO 24            70526.8     4802.6  14.685  < 2e-16 ***
## Story25 TO 27            64799.3     6652.8   9.740  < 2e-16 ***
## Area                      2752.4      327.1   8.413  < 2e-16 ***
## ModelModel A             10424.3     2968.2   3.512 0.000454 ***
## ModelModel A2            52537.4     5374.8   9.775  < 2e-16 ***
## ModelPremium Apartment   20248.3     2601.4   7.784 1.11e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29140 on 2059 degrees of freedom
## Multiple R-squared:  0.8887, Adjusted R-squared:  0.8857 
## F-statistic: 293.6 on 56 and 2059 DF,  p-value: < 2.2e-16
BIC(L6)
## [1] 49895.9
library(fitdistrplus)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: survival
fnorm <- fitdist(residuals(L1), "norm")
## Warning in sqrt(diag(varcovar)): NaNs produced
## Warning in sqrt(1/diag(V)): NaNs produced
## Warning in cov2cor(varcovar): diag(.) had 0 or NA entries; non-finite result is doubtful
result <- gofstat(fnorm, discrete = FALSE)
result
## Goodness-of-fit statistics
##                              1-mle-norm
## Kolmogorov-Smirnov statistic 0.03274423
## Cramer-von Mises statistic   0.67139103
## Anderson-Darling statistic   4.71427005
## 
## Goodness-of-fit criteria
##                                1-mle-norm
## Akaike's Information Criterion   49455.78
## Bayesian Information Criterion   49467.09
KScritvalue <-1.36/sqrt(length(df$Date))
KScritvalue
## [1] 0.02956522
summary(fnorm)
## Fitting of the distribution ' norm ' by maximum likelihood 
## Parameters : 
##           estimate Std. Error
## mean -1.540467e-12        NaN
## sd    2.874631e+04   102.2332
## Loglikelihood:  -24725.89   AIC:  49455.78   BIC:  49467.09 
## Correlation matrix:
##      mean  sd
## mean    1 NaN
## sd    NaN   1
plot(fnorm)

Since KS statistics = 0.032744 > 0.02956522=Kcrit, we can reject the null hypothesis and thus the data do provide sufficient evidence to show the normal model is the appropriate model.

residual_plot <- function(x,regression){
  n <- names(x)
  r <- residuals(regression)
  for (i in 1:length(x)){
    plot(x$n[i], r)
  }
}
residual_plot(df, L1)